Airline satisfaction is a crucial aspect in the aviation industry as it directly impacts the loyalty and customer retention. Airline satisfaction data collected from Kaggle is used to conduct various exploratory data analysis for better understanding of passenger preferences in airline services. Several classification Machine Learning algorithms are applied to predict their passengers satisfaction. With the insights and conclusions from this analysis, we seek to improve the airline services and increase passenger satisfaction.
The dataset can be found here: Airline Passenger Satisfaction Dataset
It is critical for understanding data that is available for mining and avoiding unexpected problems during data preparation. In this phase, the steps involved are collecting data, describing and exploring data and lastly to verify the data quality. Further details are elaborated in each step.
# load library
library(ggcorrplot)
library(lattice)
library(stringr)
library(dplyr)
library(data.table)
library(ggplot2)
library(ggpubr)
library(tidyr)
library(ggrepel)
library(GGally)
library(vip)
library(patchwork)
library(magrittr)
library(caret)
library(devtools)
library(caTools)
library(party)
library(e1071)
library(class)
The dataset is uncleaned by modifying and removing some values, as well as replacing with inaccurate data via Microsoft Excel. The modified dataset will be used to perform data cleaning later.
airline_df <- read.csv("Airline Passenger Satisfaction Dataset\\train_unclean.csv"
, header = TRUE
, stringsAsFactors = TRUE)
Attributes and Dimension of the dataset
colnames(airline_df)
## [1] "X" "id"
## [3] "Gender" "Customer.Type"
## [5] "Age" "Type.of.Travel"
## [7] "Class" "Flight.Distance"
## [9] "Inflight.wifi.service" "Departure.Arrival.time.convenient"
## [11] "Ease.of.Online.booking" "Gate.location"
## [13] "Food.and.drink" "Online.boarding"
## [15] "Seat.comfort" "Inflight.entertainment"
## [17] "On.board.service" "Leg.room.service"
## [19] "Baggage.handling" "Checkin.service"
## [21] "Inflight.service" "Cleanliness"
## [23] "Departure.Delay.in.Minutes" "Arrival.Delay.in.Minutes"
## [25] "satisfaction"
dim(airline_df)
## [1] 103904 25
The dataset contains 24 distinct features and 103,904 entries, including information on various aspects of the flight experience, such as in-flight Wi-Fi service, online booking ease, gate location, food and drink, seat comfort, in-flight entertainment, on-board service, legroom, baggage handling, check-in service, cleanliness, passenger demographics such as gender, age, and travel type, as well as the information on departure and arrival delay times.
First 5 rows in the dataset
head(airline_df)
## X id Gender Customer.Type Age Type.of.Travel Class
## 1 0 70172 Male Loyal customer 13 Personaltravel Eco Plus
## 2 1 5047 Male disloyal Customer 25 Businesstravel Business
## 3 2 110028 femALe Loyal Customer 26 Business travel Business
## 4 3 24026 Female Loyal Customer 25 Business travel Business
## 5 4 119299 Male Loyal Customer 61 Business travel Business
## 6 5 111157 Female Loyal Customer 26 Personal Travel Ekon
## Flight.Distance Inflight.wifi.service Departure.Arrival.time.convenient
## 1 460 3.0 4
## 2 235 3.5 2
## 3 1142 2.0 2
## 4 562 2.0 5
## 5 214 3.0 3
## 6 1180 3.5 4
## Ease.of.Online.booking Gate.location Food.and.drink Online.boarding
## 1 3 1 5.0 3
## 2 3 3 1.0 3
## 3 2 2 5.0 5
## 4 5 5 2.0 2
## 5 3 3 4.0 5
## 6 2 1 1.2 2
## Seat.comfort Inflight.entertainment On.board.service Leg.room.service
## 1 5 5 4 3
## 2 1 1 1 5
## 3 7 5 4 3
## 4 8 2 2 5
## 5 9 3 3 4
## 6 10 1 3 4
## Baggage.handling Checkin.service Inflight.service Cleanliness
## 1 4 4 5 5
## 2 3 1 4 1
## 3 4 4 4 5
## 4 3 1 4 2
## 5 4 3 3 3
## 6 4 4 4 1
## Departure.Delay.in.Minutes Arrival.Delay.in.Minutes satisfaction
## 1 25 18 neutral or dissatisfied
## 2 1 6 neutral or dissatisfied
## 3 0 0 satisfied
## 4 11 9 neutral or dissatisfied
## 5 0 0 satisfied
## 6 0 0 neutral or dissatisfied
Data Types for each attribute in the dataset
sapply(airline_df, class)
## X id
## "integer" "integer"
## Gender Customer.Type
## "factor" "factor"
## Age Type.of.Travel
## "integer" "factor"
## Class Flight.Distance
## "factor" "integer"
## Inflight.wifi.service Departure.Arrival.time.convenient
## "numeric" "integer"
## Ease.of.Online.booking Gate.location
## "integer" "integer"
## Food.and.drink Online.boarding
## "numeric" "integer"
## Seat.comfort Inflight.entertainment
## "integer" "integer"
## On.board.service Leg.room.service
## "integer" "integer"
## Baggage.handling Checkin.service
## "integer" "integer"
## Inflight.service Cleanliness
## "integer" "integer"
## Departure.Delay.in.Minutes Arrival.Delay.in.Minutes
## "integer" "integer"
## satisfaction
## "factor"
The data is composed of categorical (nominal and ordinal) and numerical (continuous) variables, which serves as a tool for understanding passenger behavior and their level of satisfaction.
Summary of Dataset
summary(airline_df)
## X id Gender Customer.Type
## Min. : 0 Min. : 1 femALe: 28 disloyal Customer:18981
## 1st Qu.: 25976 1st Qu.: 32534 Female:52699 Loyal customer : 37
## Median : 51952 Median : 64857 mAlE : 49 Loyal Customer :84886
## Mean : 51952 Mean : 64924 Male :51128
## 3rd Qu.: 77927 3rd Qu.: 97368
## Max. :103903 Max. :129880
##
## Age Type.of.Travel Class Flight.Distance
## Min. : 7.00 Business travel:71622 Bisnes : 34 Min. : 31
## 1st Qu.:27.00 Businesstravel : 33 Business:49631 1st Qu.: 414
## Median :40.00 Personal Travel:32176 Eco :20315 Median : 843
## Mean :39.38 Personaltravel : 73 Eco Plus: 7494 Mean :1189
## 3rd Qu.:51.00 Economy :26388 3rd Qu.:1743
## Max. :85.00 Ekon : 42 Max. :4983
##
## Inflight.wifi.service Departure.Arrival.time.convenient Ease.of.Online.booking
## Min. :0.000 Min. :0.00 Min. :0.000
## 1st Qu.:2.000 1st Qu.:2.00 1st Qu.:2.000
## Median :3.000 Median :3.00 Median :3.000
## Mean :2.731 Mean :3.06 Mean :2.757
## 3rd Qu.:4.000 3rd Qu.:4.00 3rd Qu.:4.000
## Max. :8.000 Max. :5.00 Max. :5.000
##
## Gate.location Food.and.drink Online.boarding Seat.comfort
## Min. :0.000 Min. :0.000 Min. :0.00 Min. : 0.000
## 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:2.00 1st Qu.: 2.000
## Median :3.000 Median :3.000 Median :3.00 Median : 4.000
## Mean :2.977 Mean :3.203 Mean :3.25 Mean : 3.443
## 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.00 3rd Qu.: 5.000
## Max. :5.000 Max. :5.500 Max. :5.00 Max. :20.000
##
## Inflight.entertainment On.board.service Leg.room.service Baggage.handling
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :1.000
## 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:3.000
## Median :4.000 Median :4.000 Median :4.000 Median :4.000
## Mean :3.358 Mean :3.382 Mean :3.351 Mean :3.632
## 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:5.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
##
## Checkin.service Inflight.service Cleanliness Departure.Delay.in.Minutes
## Min. :0.000 Min. :0.00 Min. :0.000 Min. : 0.00
## 1st Qu.:3.000 1st Qu.:3.00 1st Qu.:2.000 1st Qu.: 0.00
## Median :3.000 Median :4.00 Median :3.000 Median : 0.00
## Mean :3.304 Mean :3.64 Mean :3.286 Mean : 14.83
## 3rd Qu.:4.000 3rd Qu.:5.00 3rd Qu.:4.000 3rd Qu.: 12.00
## Max. :5.000 Max. :5.00 Max. :5.000 Max. :1592.00
## NA's :92
## Arrival.Delay.in.Minutes satisfaction
## Min. : 0.00 neutral or dissatisfied:58879
## 1st Qu.: 0.00 satisfied :45025
## Median : 0.00
## Mean : 15.18
## 3rd Qu.: 13.00
## Max. :1584.00
## NA's :310
The purpose of data checking is to ensure that the data gathered is
accurate, consistent, and complete to perform reliable analysis in the
later stage.
Check duplicate data using the id attribute (unique
identifier)
dim(airline_df[duplicated(airline_df$id),])[1]
## [1] 0
Check missing values in each attribute
colSums(is.na(airline_df))
## X id
## 0 0
## Gender Customer.Type
## 0 0
## Age Type.of.Travel
## 0 0
## Class Flight.Distance
## 0 0
## Inflight.wifi.service Departure.Arrival.time.convenient
## 0 0
## Ease.of.Online.booking Gate.location
## 0 0
## Food.and.drink Online.boarding
## 0 0
## Seat.comfort Inflight.entertainment
## 0 0
## On.board.service Leg.room.service
## 0 0
## Baggage.handling Checkin.service
## 0 0
## Inflight.service Cleanliness
## 0 0
## Departure.Delay.in.Minutes Arrival.Delay.in.Minutes
## 92 310
## satisfaction
## 0
Check distinct values in each nominal attribute
levels(airline_df$Gender)
## [1] "femALe" "Female" "mAlE" "Male"
levels(airline_df$Customer.Type)
## [1] "disloyal Customer" "Loyal customer" "Loyal Customer"
levels(airline_df$Type.of.Travel)
## [1] "Business travel" "Businesstravel" "Personal Travel" "Personaltravel"
levels(airline_df$Class)
## [1] "Bisnes" "Business" "Eco" "Eco Plus" "Economy" "Ekon"
Having clean data will ultimately increase overall productivity and allow for the highest quality information in your decision-making. 92 and 310 missing values were detected in continuous data, which are the attributes of Departure.Delay.in.Minutes and Arrival.Delay.in.Minutes respectively and are cleaned by replacing with 0 value. Incorrect, misspell and structural error values are detected in nominal values such as Gender, Customer.Type, Type.of.Travel and Class attributes. Incorrect values are found in the ordinal type of attributes which are fixed using imputation method. All noisy data are cleaned, formatted and renamed accordingly. Irrelevant attributes are dropped to get the final clean data.
Gender
Convert Gender into proper case
airline_df$Gender.C <-
as.factor(str_to_title(airline_df$Gender))
levels(airline_df$Gender.C)
## [1] "Female" "Male"
Customer Type
Convert Customer Type into proper case
airline_df$Customer.Type.C <-
as.factor(str_to_title(airline_df$Customer.Type))
levels(airline_df$Customer.Type.C)
## [1] "Disloyal Customer" "Loyal Customer"
Type of Travel
Correct Type of Travel categories
airline_df$Type.of.Travel.C <-
ifelse(grepl('^b', airline_df$Type.of.Travel, ignore.case = TRUE),
"Business Travel",
"Personal Travel")
airline_df$Type.of.Travel.C <- as.factor(airline_df$Type.of.Travel.C)
levels(airline_df$Type.of.Travel.C)
## [1] "Business Travel" "Personal Travel"
Class
Correct Class categories
airline_df$Class.C <-
ifelse(grepl('^b', airline_df$Class, ignore.case = TRUE),
"Business", "Eco")
airline_df$Class.C <- as.factor(airline_df$Class.C)
levels(airline_df$Class.C)
## [1] "Business" "Eco"
All attributes with rating are considered as ordinal data.
Inflight Wifi Service
Replace ratings above 5 as NA (missing values) then impute missing
values with mean excluding 0
airline_df$Inflight.wifi.service.C <- airline_df$Inflight.wifi.service
airline_df$Inflight.wifi.service.C[airline_df$Inflight.wifi.service.C > 5] <- NA
airline_df$Inflight.wifi.service.C[is.na(airline_df$Inflight.wifi.service.C)] <-
mean(airline_df$Inflight.wifi.service.C
[airline_df$Inflight.wifi.service.C > 0]
, na.rm = TRUE
)
Convert to integer
airline_df$Inflight.wifi.service.C <-
round(airline_df$Inflight.wifi.service.C)
Food and Drink
Replace ratings above 5 as NA (missing values) then impute missing
values with mean excluding 0
airline_df$Food.and.drink.C <- airline_df$Food.and.drink
airline_df$Food.and.drink.C[airline_df$Food.and.drink.C > 5] <- NA
airline_df$Food.and.drink.C[is.na(airline_df$Food.and.drink.C)] <-
mean(airline_df$Food.and.drink.C
[airline_df$Food.and.drink.C > 0]
, na.rm = TRUE
)
Convert to integer
airline_df$Food.and.drink.C <-
round(airline_df$Food.and.drink.C)
Seat Comfort
Replace ratings above 5 as NA (missing values) then impute missing
values with mean excluding 0
airline_df$Seat.comfort.C <- airline_df$Seat.comfort
airline_df$Seat.comfort.C[airline_df$Seat.comfort.C > 5] <- NA
airline_df$Seat.comfort.C[is.na(airline_df$Seat.comfort.C)] <-
mean(airline_df$Seat.comfort.C
[airline_df$Seat.comfort.C > 0]
, na.rm = TRUE
)
Convert to integer
# convert to integer
airline_df$Seat.comfort.C <-
round(airline_df$Seat.comfort.C)
All attributes with ratings
Convert all ratings attributes to ordinal data
ratings <- c("Inflight.wifi.service.C", "Food.and.drink.C", "Seat.comfort.C",
"Departure.Arrival.time.convenient", "Ease.of.Online.booking",
"Gate.location", "Online.boarding", "Inflight.entertainment",
"On.board.service", "Leg.room.service", "Baggage.handling",
"Checkin.service", "Inflight.service", "Cleanliness"
)
airline_df[ratings] <- lapply(airline_df[ratings],
factor,
levels=c(0, 1, 2, 3, 4, 5)
)
Departure & Arrival Delay in minutes
Replace NA with 0
delay_cols <- c("Departure.Delay.in.Minutes", "Arrival.Delay.in.Minutes")
airline_df[delay_cols][is.na(airline_df[delay_cols])] <- 0
# rearrange columns
col_order <- c("X", "id", "Gender", "Gender.C", "Customer.Type",
"Customer.Type.C", "Age", "Type.of.Travel", "Type.of.Travel.C",
"Class", "Class.C","Flight.Distance", "Inflight.wifi.service",
"Inflight.wifi.service.C", "Departure.Arrival.time.convenient",
"Ease.of.Online.booking", "Gate.location", "Food.and.drink",
"Food.and.drink.C", "Online.boarding", "Seat.comfort",
"Seat.comfort.C", "Inflight.entertainment", "On.board.service",
"Leg.room.service", "Baggage.handling", "Checkin.service",
"Inflight.service", "Cleanliness", "Departure.Delay.in.Minutes",
"Arrival.Delay.in.Minutes", "satisfaction"
)
airline_df <- airline_df[, col_order]
# drop columns
drop_col <- c("Gender", "Customer.Type", "Type.of.Travel", "Class",
"Inflight.wifi.service", "Food.and.drink", "Seat.comfort")
airline_df[, drop_col] <- list(NULL)
# rename columns
old_colnames <- c("Gender.C", "Customer.Type.C", "Type.of.Travel.C", "Class.C",
"Inflight.wifi.service.C", "Food.and.drink.C", "Seat.comfort.C"
)
new_colnames <- drop_col
setnames(airline_df, old = old_colnames, new = new_colnames)
# drop "X" column
airline_df[, "X"] <- list(NULL)
In this section, we will walk through an overall view of the airline satisfaction dataset by customer demographics and service ratings.
Below graphs represent an overview of the passengers demographics and characteristics, which will be used in further analysis.
Customer Gender
As seen in the pie chart, the distribution of genders among the passenger population is relatively balanced, with 50.7% of the passengers being male and 49.3% of the passengers being female. The slight difference of 1.4% between the two indicates that the data is not skewed towards one gender, allowing for a fair representation of both male and female passengers.
Customer Age
Most customers are either in their twenties or fourties. The age distribution of the passenger population shows a clear pattern of increasing numbers as age increases, peaking in the twenties range. The number then drops slightly in the thirties age range, before climbing back up to the highest peak in the forties. From there, the number of passengers gradually decreases as age increases. This pattern is clearly visible in the graph, providing an overall picture of the age distribution among the passenger population.
Customer Type
It can be observed that 18% of the customers are disloyal and 82% are loyal. This indicates that most of the customers are regular customers of the airline company. Thus, special loyalty marketing campaigns can be strategised to retain loyal customers for sustaining sales and profits.
Travel Type
Additionally, the data shows that 69% of the customers are traveling for business purposes and 31% are traveling for personal reasons. This provides insight on type of customers the company is serving, enabling targeted marketing and sales campaigns based on customer segmentation.
Flight Class
Lastly, regarding flight class, the data indicates that 47.8% of the customers are traveling in business class while 52.2% in economy class. This information can be used to understand the type of customers that are traveling in different classes, adjust pricing and inventory management strategies, and understand the revenue generated from different classes of customers.
Flight Arrival & Delay in Minutes
Overall, the data gives valuable insights on customer base and can be used for decision-making in business and improving customer satisfaction. It benefits the airline company with customer profiling, allowing them to create personalized services to their customers that helps in further growing their business.
It’s observed that the check in Service, ease of online booking, gate location and inflight wifi service are not performing up to passengers’ expectation and need further improvement to satisfy customers.
Now, we will perform analysis from different perspectives, such as customer age, customer gender, customer type, travel type and flight class.
Chart above describes the satisfaction (count of satisfied customers) by age. It is evident from the plot that more number of younger customers (20–60 years) are satisfied with the airline, whereas a higher number of customers (10–60 years) are dissatisfied with the airline. Older customers were in general more satisfied with all features and rated them higher compared to younger travelers. This could indicate that younger travelers are more critical of factors and have higher expectations from products when compared to older/ middle aged population.
Different perspectives from Customer Gender
From gender aspect, the satisfaction is indifferent for both gender thus
we deep dive into the specific services vs gender.
Online.boarding across Gender
In comparison, female customers are more satisfied in the online boarding service than male customers as most of the ratings are above 3 ratings.
Seat.comfort across Gender
In comparison, female customers are more satisfied in the seat comfort aspect than male customers as most of the ratings are above 3 ratings.
Leg.room.service across Gender
Female passengers are more dissatisfied with the lower rating given to leg room service, as compared to male passengers.
Baggage.handling across Gender
Female passengers are more dissatisfied in baggage handling service than male passengers. Possible reason is that it is common for female passengers carry cosmetic products that are fragile items and thus negligence in handling with care cause them to be unhappy with the service.
Inflight.service across Gender
Female passengers are more dissatisfied in inflight service than male passengers.
Different perspectives from Customer Type
Since majority of the customers are loyal customers, hence we will focus on the ratings distribution of loyal customers. Looking into the distribution of ratings for loyal customers, most services have similar distribution of ratings in each customer type with gradually increase in the count from rating 1 to 4, peak at rating 4, and slight decrease in the rating count at rating 5. However, some services did not follow this pattern, such as ease of online booking, gate location and in-flight Wi-fi service, therefore we will focus on the distribution of ratings in these 3 services in the graphs below.
Different perspectives from Type of Travel
Business travelers tend to be older and have longer flight distances than personal travelers. This could be due to a number of factors such as the nature of their work requiring more frequent and longer-distance travel, or the fact that older individuals may be more likely to hold positions that involve traveling for business.
Additionally, business travelers may also have a higher disposable income and be more likely to upgrade to premium seating or business class, which can result in longer flight distances.
Furthermore, older travelers may also prefer longer flights as they offer more comfort and amenities.
From the perspective of “Type of Travel,” it is advised that Airlines should give more attention to the four services such as “In-flight Wi-fi service”, “Ease of Online Booking”, “Departure Arrival time convenient” and “Gate location” as they are identified to be major sources of dissatisfaction among business travelers.
By identifying and addressing these specific areas, Airlines can take effective measures to improve the overall travel experience for their customers, resulting in increased satisfaction and loyalty.
On the other hand, when it comes to Personal Travel, the service of “Departure Arrival Time Convenience” receives significantly high ratings and the other services still have a significant potential to improve as majority of the ratings for those services fall between “1” and “3” which indicates that those services are not performing as well as they could.
By taking initiative to improve these services, the ratings could be increased from “1” to “3” and from “3” to “5”, indicating a significant increase in passenger satisfaction and loyalty.
Different perspectives from Flight Class
In economy class, there is a higher proportion of passengers aged below 30 and between 60 to 70 years old, whereas in business class, there is a higher proportion of passengers aged between 30 to 60 years old. The demographic differences between economy and business class passengers could be due to various factors such as pricing, travel purpose, life stage and perception of comfort.
Younger passengers might choose Eco class due to cost concerns, as it is typically the most affordable option. They might also be more likely to be traveling for leisure or personal reasons rather than for work, leading them to prioritize cost savings over amenities. Alternatively, younger passengers might prioritize other aspects of their trip, such as destination or duration, and choose the Eco class in order to allocate more of their budget to those factors.
Business class is often chosen by adults for the enhanced comfort it provides, such as more spacious seating and additional amenities. It may also be preferred for the convenience it offers, including priority services at the airport and access to exclusive lounges. The ability to work or relax in a more comfortable setting during long flights can also be a factor in the choice of business class. Additionally, some adults may be attracted to the prestige associated with flying in this class.
Classification is the process of predicting the target label based on features in the dataset. In this project, we will build classification model using different algorithms to predict the satisfaction of customers (either satisfied or neutral/dissatisfied) based on available features such as customer demographics, flight details, service ratings etc. The algorithms used are:
Convert categorical data to numerical data using:
Ordinal Data
Convert Ordinal to Integer
airline_df_m <- airline_df
ord_ratings <- c("Inflight.wifi.service", "Food.and.drink", "Seat.comfort",
"Departure.Arrival.time.convenient", "Ease.of.Online.booking",
"Gate.location", "Online.boarding", "Inflight.entertainment",
"On.board.service", "Leg.room.service", "Baggage.handling",
"Checkin.service", "Inflight.service", "Cleanliness"
)
airline_df_m[ord_ratings] <-
lapply(airline_df_m[ord_ratings],
function(x) as.numeric(as.character(x))
)
Nominal Data
One-hot encoding to convert each category into a new column and assign a
value as 1 or 0 to the column
# One-Hot encoding
features <- c("Gender", "Customer.Type", "Age", "Type.of.Travel", "Class",
"Flight.Distance", ord_ratings, "Departure.Delay.in.Minutes",
"Arrival.Delay.in.Minutes")
airline_df_m_trsf <-
cbind(data.frame(predict(dummyVars(
~.,
data = airline_df_m),
airline_df_m)
),
satisfaction = airline_df_m$satisfaction
)
airline_df_m_trsf <- subset(airline_df_m_trsf,
select = -c(satisfaction.neutral.or.dissatisfied,
satisfaction.satisfied, id))
# Remove perfectly correlated variables
airline_df_m_trsf <- subset(airline_df_m_trsf,
select = -c(Gender.Male, Customer.Type.Loyal.Customer,
Type.of.Travel.Personal.Travel, Class.Eco))
# Set satisfaction as factors
airline_df_m_trsf$satisfaction <- factor(airline_df_m_trsf$satisfaction)
Further investigation on the correlation between different airline services and overall satisfaction. It is observed that the class and online boarding service have a stronger positive correlation with the overall satisfaction, i.e. more than 0.5 Pearson correlation coefficient. Passengers rely on the type of flight class and the online boarding service to rate higher satisfaction for the flight.
#Changing Satisfaction attribute to numeric
airline_corr<- airline_df_m_trsf %>%
select(everything()) %>% mutate_if(is.factor,as.numeric)
str(airline_corr)
## 'data.frame': 103904 obs. of 23 variables:
## $ Gender.Female : num 0 0 1 1 0 1 0 1 1 0 ...
## $ Customer.Type.Disloyal.Customer : num 0 1 0 0 0 0 0 0 0 1 ...
## $ Age : num 13 25 26 25 61 26 47 52 41 20 ...
## $ Type.of.Travel.Business.Travel : num 0 1 1 1 1 0 0 1 1 1 ...
## $ Class.Business : num 0 1 1 1 1 0 0 1 1 0 ...
## $ Flight.Distance : num 460 235 1142 562 214 ...
## $ Inflight.wifi.service : num 3 4 2 2 3 4 2 4 1 4 ...
## $ Departure.Arrival.time.convenient: num 4 2 2 5 3 4 4 3 2 3 ...
## $ Ease.of.Online.booking : num 3 3 2 5 3 2 2 4 2 3 ...
## $ Gate.location : num 1 3 2 5 3 1 3 4 2 4 ...
## $ Food.and.drink : num 5 1 5 2 4 1 2 3 4 2 ...
## $ Online.boarding : num 3 3 5 2 5 2 2 5 3 3 ...
## $ Seat.comfort : num 5 1 3 3 3 3 3 3 3 3 ...
## $ Inflight.entertainment : num 5 1 5 2 3 1 2 5 1 2 ...
## $ On.board.service : num 4 1 4 2 3 3 3 5 1 2 ...
## $ Leg.room.service : num 3 5 3 5 4 4 3 5 2 3 ...
## $ Baggage.handling : num 4 3 4 3 4 4 4 5 1 4 ...
## $ Checkin.service : num 4 1 4 1 3 4 3 4 4 4 ...
## $ Inflight.service : num 5 4 4 4 3 4 5 5 1 3 ...
## $ Cleanliness : num 5 1 5 2 3 1 2 4 2 2 ...
## $ Departure.Delay.in.Minutes : num 25 1 0 11 0 0 9 4 0 0 ...
## $ Arrival.Delay.in.Minutes : num 18 6 0 9 0 0 23 0 0 0 ...
## $ satisfaction : num 1 1 2 1 2 1 1 2 1 1 ...
#Creating correlation between different airline services and overall satisfaction.
df_corr<-cor(airline_corr[-1], airline_corr$satisfaction)
colnames(df_corr)[1] <- "Satisfaction"
df_corr
## Satisfaction
## Customer.Type.Disloyal.Customer -0.1876381714
## Age 0.1371673050
## Type.of.Travel.Business.Travel 0.4490004498
## Class.Business 0.5038484626
## Flight.Distance 0.2987797858
## Inflight.wifi.service 0.2838195658
## Departure.Arrival.time.convenient -0.0516006177
## Ease.of.Online.booking 0.1717049785
## Gate.location 0.0006820275
## Food.and.drink 0.2097962131
## Online.boarding 0.5035573216
## Seat.comfort 0.3492676800
## Inflight.entertainment 0.3980594211
## On.board.service 0.3223825215
## Leg.room.service 0.3131308017
## Baggage.handling 0.2477493654
## Checkin.service 0.2361737428
## Inflight.service 0.2447407387
## Cleanliness 0.3051980118
## Departure.Delay.in.Minutes -0.0504942103
## Arrival.Delay.in.Minutes -0.0574353182
## satisfaction 1.0000000000
#Creating correlation heatmap between different attributes.
corr_mat <- round(cor(airline_corr),2)
head(corr_mat)
## Gender.Female Customer.Type.Disloyal.Customer
## Gender.Female 1.00 0.03
## Customer.Type.Disloyal.Customer 0.03 1.00
## Age -0.01 -0.28
## Type.of.Travel.Business.Travel 0.01 0.31
## Class.Business -0.01 -0.09
## Flight.Distance -0.01 -0.23
## Age Type.of.Travel.Business.Travel
## Gender.Female -0.01 0.01
## Customer.Type.Disloyal.Customer -0.28 0.31
## Age 1.00 0.05
## Type.of.Travel.Business.Travel 0.05 1.00
## Class.Business 0.14 0.55
## Flight.Distance 0.10 0.27
## Class.Business Flight.Distance
## Gender.Female -0.01 -0.01
## Customer.Type.Disloyal.Customer -0.09 -0.23
## Age 0.14 0.10
## Type.of.Travel.Business.Travel 0.55 0.27
## Class.Business 1.00 0.47
## Flight.Distance 0.47 1.00
## Inflight.wifi.service
## Gender.Female -0.01
## Customer.Type.Disloyal.Customer -0.01
## Age 0.02
## Type.of.Travel.Business.Travel 0.10
## Class.Business 0.03
## Flight.Distance 0.01
## Departure.Arrival.time.convenient
## Gender.Female -0.01
## Customer.Type.Disloyal.Customer -0.21
## Age 0.04
## Type.of.Travel.Business.Travel -0.26
## Class.Business -0.10
## Flight.Distance -0.02
## Ease.of.Online.booking Gate.location
## Gender.Female -0.01 0.00
## Customer.Type.Disloyal.Customer -0.02 0.01
## Age 0.02 0.00
## Type.of.Travel.Business.Travel 0.13 0.03
## Class.Business 0.11 0.00
## Flight.Distance 0.07 0.00
## Food.and.drink Online.boarding Seat.comfort
## Gender.Female -0.01 0.04 0.03
## Customer.Type.Disloyal.Customer -0.06 -0.19 -0.16
## Age 0.02 0.21 0.16
## Type.of.Travel.Business.Travel 0.06 0.22 0.12
## Class.Business 0.09 0.33 0.23
## Flight.Distance 0.06 0.21 0.16
## Inflight.entertainment On.board.service
## Gender.Female -0.01 -0.01
## Customer.Type.Disloyal.Customer -0.11 -0.06
## Age 0.08 0.06
## Type.of.Travel.Business.Travel 0.15 0.06
## Class.Business 0.20 0.22
## Flight.Distance 0.13 0.11
## Leg.room.service Baggage.handling
## Gender.Female -0.03 -0.04
## Customer.Type.Disloyal.Customer -0.05 0.02
## Age 0.04 -0.05
## Type.of.Travel.Business.Travel 0.14 0.03
## Class.Business 0.21 0.17
## Flight.Distance 0.13 0.06
## Checkin.service Inflight.service Cleanliness
## Gender.Female -0.01 -0.04 -0.01
## Customer.Type.Disloyal.Customer -0.03 0.02 -0.08
## Age 0.04 -0.05 0.05
## Type.of.Travel.Business.Travel -0.02 0.02 0.08
## Class.Business 0.16 0.17 0.14
## Flight.Distance 0.07 0.06 0.09
## Departure.Delay.in.Minutes
## Gender.Female 0.00
## Customer.Type.Disloyal.Customer 0.00
## Age -0.01
## Type.of.Travel.Business.Travel 0.01
## Class.Business -0.01
## Flight.Distance 0.00
## Arrival.Delay.in.Minutes satisfaction
## Gender.Female 0.00 -0.01
## Customer.Type.Disloyal.Customer 0.00 -0.19
## Age -0.01 0.14
## Type.of.Travel.Business.Travel 0.01 0.45
## Class.Business -0.01 0.50
## Flight.Distance 0.00 0.30
ggcorrplot::ggcorrplot(corr_mat, hc.order = TRUE, type = "lower",
lab = TRUE,lab_size =2,
title="Correlations in Airline Passenger Satisfaction Dataset",
legend.title = "Pearson \n Corr")+
theme(axis.text.x = element_text(angle = 45, vjust = 1,
size = 8, hjust = 1))+
theme(axis.text.y = element_text(vjust = 1,
size = 8, hjust = 1))
Data is split into train and test dataset (80:20), 82,258 and 21,646 records respectively.
# Splitting dataset
split <- sample.split(airline_df_m_trsf, SplitRatio = 0.8)
train <- subset(airline_df_m_trsf, split == "TRUE")
test <- subset(airline_df_m_trsf, split == "FALSE")
We will build the classification model using several classifiers, which is Logistic Regression, Decision Trees and K-Nearest Neighbors (KNN), then evaluate, discuss and compare the results accordingly.
Logistic Regression is a regression algorithm that predicts the value of target variable (dependent variable) by investigating the relationship between the independent variable and the dependent variable. There are few important assumptions for regression, which is appropriate outcome structure, observations independence, little or no multicollinearity among the independent variables, linearity of independent variables and log odds and large sample size.
Fit logistic regression model using the glm (generalized linear
model) function
Model Summary:
logit_model <- glm(satisfaction ~.,
family = binomial(link='logit'),
data = train)
summary(logit_model)
##
## Call:
## glm(formula = satisfaction ~ ., family = binomial(link = "logit"),
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8451 -0.4918 -0.1761 0.3869 3.9981
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.240e+00 8.849e-02 -104.412 < 2e-16 ***
## Gender.Female -5.780e-02 2.199e-02 -2.628 0.00858 **
## Customer.Type.Disloyal.Customer -2.011e+00 3.340e-02 -60.197 < 2e-16 ***
## Age -8.510e-03 8.016e-04 -10.617 < 2e-16 ***
## Type.of.Travel.Business.Travel 2.728e+00 3.541e-02 77.042 < 2e-16 ***
## Class.Business 7.367e-01 2.796e-02 26.345 < 2e-16 ***
## Flight.Distance -1.232e-05 1.275e-05 -0.967 0.33378
## Inflight.wifi.service 3.853e-01 1.292e-02 29.819 < 2e-16 ***
## Departure.Arrival.time.convenient -1.197e-01 9.267e-03 -12.917 < 2e-16 ***
## Ease.of.Online.booking -1.450e-01 1.280e-02 -11.323 < 2e-16 ***
## Gate.location 2.435e-02 1.033e-02 2.356 0.01847 *
## Food.and.drink -3.019e-02 1.200e-02 -2.515 0.01189 *
## Online.boarding 6.170e-01 1.157e-02 53.336 < 2e-16 ***
## Seat.comfort 6.792e-02 1.258e-02 5.399 6.72e-08 ***
## Inflight.entertainment 7.020e-02 1.603e-02 4.380 1.19e-05 ***
## On.board.service 2.992e-01 1.148e-02 26.054 < 2e-16 ***
## Leg.room.service 2.538e-01 9.637e-03 26.338 < 2e-16 ***
## Baggage.handling 1.375e-01 1.288e-02 10.672 < 2e-16 ***
## Checkin.service 3.235e-01 9.654e-03 33.505 < 2e-16 ***
## Inflight.service 1.173e-01 1.355e-02 8.654 < 2e-16 ***
## Cleanliness 2.200e-01 1.364e-02 16.129 < 2e-16 ***
## Departure.Delay.in.Minutes 4.025e-03 1.024e-03 3.932 8.41e-05 ***
## Arrival.Delay.in.Minutes -8.631e-03 1.013e-03 -8.518 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 111270 on 81314 degrees of freedom
## Residual deviance: 54372 on 81292 degrees of freedom
## AIC: 54418
##
## Number of Fisher Scoring iterations: 6
Model Prediction & Evaluation
Model evaluation using confusion matrix.
# Predict test data
predict_reg <- predict(logit_model,
subset(test,
select = -c(satisfaction)
),
type = "response")
predict_reg <- ifelse(predict_reg >0.5, "satisfied", "neutral or dissatisfied")
# Confusion matrix
confusionMatrix(factor(predict_reg), test$satisfaction)
## Confusion Matrix and Statistics
##
## Reference
## Prediction neutral or dissatisfied satisfied
## neutral or dissatisfied 11569 1607
## satisfied 1219 8194
##
## Accuracy : 0.8749
## 95% CI : (0.8705, 0.8792)
## No Information Rate : 0.5661
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7442
##
## Mcnemar's Test P-Value : 3.341e-13
##
## Sensitivity : 0.9047
## Specificity : 0.8360
## Pos Pred Value : 0.8780
## Neg Pred Value : 0.8705
## Prevalence : 0.5661
## Detection Rate : 0.5122
## Detection Prevalence : 0.5833
## Balanced Accuracy : 0.8704
##
## 'Positive' Class : neutral or dissatisfied
##
Decision tree is a non-parametric supervised learning algorithm for classification and prediction. It is structured as a hierarchical flowchart-like tree where:
The splitting of source set into subset is repeated on each derived subset and the recursion is completed when the splitting no longer adds value to the prediction. Decision Tree model does not require parameter setting and provide a clear indication of which attributes are most important for prediction.
Build the decision tree model using ctree.
dt_model <- ctree(satisfaction ~ ., train)
# plot(dt_model)
Model Prediction & Evaluation
Model evaluation using confusion matrix.
# Predict test data
predict_dt <- predict(dt_model, test)
# Confusion matrix
dt_m <- table(test$satisfaction, predict_dt)
dt_m
## predict_dt
## neutral or dissatisfied satisfied
## neutral or dissatisfied 12423 365
## satisfied 673 9128
dt_ac <- sum(diag(dt_m)) / sum(dt_m)
print(paste('Accuracy =', dt_ac))
## [1] "Accuracy = 0.954048430652087"
The accuracy score for Decision Tree model is around 95%, which is relatively high. From the confusion matrix, the model has correctly predicted that 8,717 passengers are satisfied with the services provided, while 11,836 passengers rated neutral or dissatisfied. By analogy, the model misclassified 436 as happy passengers and 658 as unhappy passengers.
The K-Nearest Neighbors (KNN) algorithm is a simple, yet powerful, supervised machine learning algorithm. It is used for both classification and regression problems. In classification, the idea is to find the k-number of nearest data points in the training set for a new data point, and then predict the class of the new data point based on the majority class among its k-nearest neighbors.
Fit the KNN model
# Feature scaling
# Normalize the range of independent variables or features of data
train_scale <- scale(train[, 1:22])
test_scale <- scale(test[, 1:22])
# Determine the optimal k value using square root of total observations
# sqrt(NROW(train))
# 285.1596
# KNN (try 285, 286)
knn_model_285 <- knn(train = train_scale,
test = test_scale,
cl = train$satisfaction,
k = 285)
knn_model_286 <- knn(train = train_scale,
test = test_scale,
cl = train$satisfaction,
k = 286)
Model Prediction & Evaluation
Model evaluation using confusion matrix
# Confusion Matrix
confusionMatrix(knn_model_285, test$satisfaction)
## Confusion Matrix and Statistics
##
## Reference
## Prediction neutral or dissatisfied satisfied
## neutral or dissatisfied 12252 1596
## satisfied 536 8205
##
## Accuracy : 0.9056
## 95% CI : (0.9017, 0.9094)
## No Information Rate : 0.5661
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8054
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9581
## Specificity : 0.8372
## Pos Pred Value : 0.8847
## Neg Pred Value : 0.9387
## Prevalence : 0.5661
## Detection Rate : 0.5424
## Detection Prevalence : 0.6130
## Balanced Accuracy : 0.8976
##
## 'Positive' Class : neutral or dissatisfied
##
confusionMatrix(knn_model_286, test$satisfaction)
## Confusion Matrix and Statistics
##
## Reference
## Prediction neutral or dissatisfied satisfied
## neutral or dissatisfied 12252 1600
## satisfied 536 8201
##
## Accuracy : 0.9054
## 95% CI : (0.9016, 0.9092)
## No Information Rate : 0.5661
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.805
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9581
## Specificity : 0.8368
## Pos Pred Value : 0.8845
## Neg Pred Value : 0.9387
## Prevalence : 0.5661
## Detection Rate : 0.5424
## Detection Prevalence : 0.6132
## Balanced Accuracy : 0.8974
##
## 'Positive' Class : neutral or dissatisfied
##
The result shows that the accuracy of the KNN model is 90.6% for k = 285 and 90.5% for k = 286.
The accuracy is quite high, which means that the model is performing well and making correct predictions for a high percentage of the test data. (Generally, an accuracy of 90% or higher is considered to be very good.)
The confusion matrix results indicate that the KNN model with k = 285 and k = 286 have a high accuracy of around 90%, and good sensitivity and specificity.
The model is able to correctly predict the majority of the test data points and it may have slightly better performance with k = 285, but the difference is very small.
In order to further improve the KNN model, we proceed to find the optimal k value.
# Improve model and find optimal k value
# i = 1
# k.optm = 1
# for (i in 1:25){
# knn <- knn(train = train_scale,
# test = test_scale,
# cl = train$satisfaction,
# k = i)
# k.optm[i] <- 100 * sum(test$satisfaction == knn)/NROW(test)
# k = i
# cat(k, '=', k.optm[i], '\n')
# }
# 1 = 91.39795
# 2 = 91.00065
# 3 = 92.69611
# 4 = 92.39582
# 5 = 92.86242
# 6 = 92.72845
# 7 = 92.99185
# 8 = 92.73769
# 9 = 92.83932
# 10 = 92.8578
# 11 = 92.89014
# 12 = 92.8116
# 13 = 92.91324
# 14 = 92.86704
# 15 = 92.71921
# 16 = 92.64529
# 17 = 92.64529
# 18 = 92.56214
# 19 = 92.64529
# 20 = 92.66839
# 21 = 92.58062
# 22 = 92.4836
# 23 = 92.51132
# 24 = 92.43278
# 25 = 92.46974
# plot(k.optm, type="b", xlab="K- Value", ylab="Accuracy level")
# Optimal k: 7
# KNN (use 7)
knn_model_7 <- knn(train = train_scale,
test = test_scale,
cl = train$satisfaction,
k = 7)
# Confusion Matrix
confusionMatrix(knn_model_7, test$satisfaction)
## Confusion Matrix and Statistics
##
## Reference
## Prediction neutral or dissatisfied satisfied
## neutral or dissatisfied 12360 1207
## satisfied 428 8594
##
## Accuracy : 0.9276
## 95% CI : (0.9242, 0.931)
## No Information Rate : 0.5661
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8513
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9665
## Specificity : 0.8768
## Pos Pred Value : 0.9110
## Neg Pred Value : 0.9526
## Prevalence : 0.5661
## Detection Rate : 0.5472
## Detection Prevalence : 0.6006
## Balanced Accuracy : 0.9217
##
## 'Positive' Class : neutral or dissatisfied
##
Summarize the EDA part and model part. Airline of which class need to focus on what services?
In overall, it is suggested that Airlines should give more attention to the four services which are “In-flight Wi-Fi service,” “Ease of Online Booking,” “Departure Arrival time convenient,” and “Gate location” as they are identified to be major sources of dissatisfaction among business travelers.
For Personal Travel, the majority of the ratings for these services fall between “1” and “3” which indicates that these services are not performing as well as they could. By taking initiative to improve these services, the ratings could be increased, indicating a significant increase in passenger satisfaction and loyalty.
YuXuan Lim:
Bhandari, P. (2023, January 9). Descriptive Statistics | Definitions,
Types, Examples. Scribbr. https://www.scribbr.com/statistics/descriptive-statistics/
GeeksforGeeks. (2023a, January 10). Decision Tree. https://www.geeksforgeeks.org/decision-tree/
GeeksforGeeks. (2023b, January 11). Basic Concept of Classification
(Data Mining). https://www.geeksforgeeks.org/basic-concept-classification-data-mining/
Gupta, P. (2018, June 20). Decision Trees in Machine Learning - Towards
Data Science. Medium. https://towardsdatascience.com/decision-trees-in-machine-learning-641b9c4e8052
Regression Analysis: Step by Step Articles, Videos, Simple Definitions.
(2022, December 1). Statistics How To. https://www.statisticshowto.com/probability-and-statistics/regression-analysis/